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This article presents a systemof a morphological analyzer of the Arabic 
language, by integrating several approaches and the viterbi algorithm. First 
approach is based on database for all thesurface patterns in the Arabic 
language, second approach is Buckwalter Arabic morphological analyzer 
and the last approach is based on finite state automaton. With the integration 
of correspondence tables between affixes in these approaches. The 
combination between these approaches in our analyzer is very important. 
Our analyzer is tested on a morphological corpus of 200,000 words, which 
generalize the words of the Arabic language. The effectiveness of the 
proposed approaches is demonstrated experimentally and the results 
obtained are comparable to the state of the art. Moreover, it shows the 
interest and the advantages of integrating these approaches are to improve 
our morphological analyzer. 
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1. INTRODUCTION 

Morphological analysis is a very important step in various applications of natural language 
processing [1], [2]. Integration of approaches for morphological analyzer of the Arabic language is necessary 
[3], it requires the development of algorithms that can interpret and analyze word structure at many levels 
[4], such as processing linguistic rules, patterns of Arabic words and data dictionary, etc. Morphological 
analysis is used in a variety of applications of natural language processing [5], amongthem: information 
retrieval and extraction, machine translation, text mining, machine synthesis and Arabic learning systems [6]. 

The morphological analysis of Arabic is very complicated in the automatic processing because of 
the structure of complex word where we have stems, infixes, prefixes, suffixes, and complex patterns [7], [8]. 
It detects the different morphological entities in the word and provides a morphological representation. More, 
for each prefix or suffix can have its own syntactic attachment; this means that we have the resources to use 
the results of the morphological analysis stage in the higher stages of Arabic processing as syntactic analysis 
and error processing. 
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In recent years, several works have been developed in the axis of morphological analysis of the 
arabic language which are generally based on one of the following approaches: first approach based on 
linguistic rules [9]-[11], the second approach based on dictionary-based [10], [11], the following approach 
based on a word pattern [12]-[15], the fourth approach based on finite state automaton [16]-[18] andthe last 
approach is hybrid approach which combines its different approaches [19]. 

In this work, we have proposed a study of approaches based on surface pattern and on finite state 
automaton. This allowed detecting the types of errors and the strengths and weaknesses of each analyzer. 
Subsequently, it will be very interesting to combine these approaches in a single analyzer to increase both the 
precision and the recall and to decrease the execution time compared to our first analyzers [20]. It is very 
important to combine several approaches to process and analyze words, in the arabic language, several 
analyzers have been developed, we can cite, for example, that of [10], [12], [18], [21]. In this article, we 
propose an integration of several approaches to build an Arabic morphological analyzer that meets all needs. 

In sub-section 2.1, we will present a look on ourmorphological analyzer based on surface patterns. 
Sub-section 2.2, we will present an overview on our second morphological analyzer based on finite state 
automate and viterbi algorithm [22]. Section 3, we will present an integration of several approaches in order 
to build a morphological analyzer that deals with all cases of Arabic words. In section 5, we describe our 
morphological analyzer with tests and results. We provide our method for evaluating the approaches in the 
previous sections. In the last section, we conclude this work with some conclusions and recommendations. 


2. METHOD 
2.1. Morphological analysis based on thesurface patterns 

We have developed an approach to improve our morphological analysis which is based on the 
surface pattern of arabic language words. It is mainly based on the construction of the surface patterns 
database. This morphological analyzerdetermines one or many possible patterns for a given word, in order to 
find all possible analyzes of this word. 

Patterns allow effectively modeling morphological variations within words and detecting the root of 
a word. in this axis, several works have been developed which use the pattern-root approach, among which 
we cite [12], [14]. All these works use for the morphological analysis of words the classical patterns of 
Arabic words. In our morphology analyzer, we use a new adapted pattern that we called surface patterns. 

To build the database of surface patterns of Arabic words: For exemple the classical pattern of the 
word (| sts) is (!5l4), but its surface pattern is ('sl4), The algorithm we used to build surface patterns from a 
word: For a wordw = 1, l,...1, Un Character of the wordw) and R its root. The surface patternsofwordwisp = 


i is one of three letters "“x¢«J ifl, € Rand |; ¢ {'« 5} 
f, = 1, if lj is not in R Where |; € {!« ss} 


relying on the surface patternsdatabase approach, for the word "Ust8" the root is "J" and the surface pattern 
is "Gslla", The surface patterns of the rootR = g,g2..- Gx (giis a character root) isP’ = f',f'> ... f’, with: 


f'; = one of three lettres (J «¢ «-4) if gjis a non variant letter in R 
f; = g, ifnot 


non-variant letter in a root R is a letter that staysthe same when generating words from that root. 
To perform out the morphological analysis with a word w by the surface patterns approach, we go 
through the following steps: 


f(m;w) = Yh, Lm, wi 


we keep just the surface patterns having f# 0. 

-  Extractiononly of the surface patterns of the solution roots from the surface patterns of the word 
analyze. 

- Construction of roots from surface patterns, roots associated with word w and andverification whether 
these roots exist in the root database or not. 

- To test and evaluate our approach, we have constructed all surface patterns of words derived from the 
Arabic language. This step was handled by a group of Arabic language linguists. They used a set of 
Arabic language references. 
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Example: 
The phases of analysis of the word "J si": 
- Searching for the surface patterns corresponding to the word"Js%i". We find these surface patterns: 
Py="J sii"; Po= "J snd. 
a. "Js4i"from the root"J&". 
b. — "JUse4" from the root"Jsi". 
- Extraction the surface patterns of the roots of P; and P2,we findbothsurface patterns of the roots 
SR)="Jl" and SR2="J2", 
= Construction of the roots SR, SR2, from Pi, P2, and word W. We find the following root solutions: 
R,="J8" and Ro="Jii". 
For our surface patterns based analyzer (Figure 1). We used the following sources: 
- Lexicon of 6,216 surface patterns. This lexicon contains all the morphological classes of words derived 
from the arabic language. 
- Root dictionary containing 1,200 roots. 
- Radical dictionary containing 6,000 radicals. 


Segmentation Words to 
be analyzed into Base of surface 
prefixes and suffixes Search for pattern 
patterns of 
radicals 


words with prefix and 
suffixes Possible patterns of 
radicals. 


Extraction of 
possible roots and 
verification of their 
validity 


Viewing 
Solutions 


Figure 1. The steps of our morphological analyzer of Arabic words who use surface patterns 


2.2. Morphological analyzer based on finite state automaton 

The finite state automaton is an adaptive automaton which successively changes its structure 
according to the application of adaptive actions associated with the transition rules carried out by the 
automaton. The finite state automaton he has great potential for to be used in natural language processing 
[16]. It able to simplify and represents complex linguistic situations such as ambiguities and non- 
determinisms especially in the Arabic language. Additionally, the recognition formalism can be put in place 
for a recognition formalism can be implemented for pre-processing texts for a variety of scenarios such as 
morphologicalanalysis, syntactic verification, text interpretation, automatic translation and computer-assisted 
language learning [7], [23]. The form of the finite state automaton or adaptive automaton makes it possible to 
process the different classes of languages. 

The finite state automaton analyzer [17] is an analyzer where each word of the Arabic language is 
represented by a path in this finite state automaton. To analyze a word, the finite-state automaton analyzer 
goes through the following two steps; The finite state automaton analyzer [24] is a morphological word 
analyzer, where every word of the Arabic language is represented by paths of the Arabic alphabets in this 
finite state automaton. For morphological analysis of a word, the finite-state automaton analyzer goes 
through the following two steps: 
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- Construction of a network of all the words of the Arabic language. 
- Search the possible solutions for our analyzer in this global network. 

This system based on very restricted dictionaries and searches the solutions in the global network 
using the Viterbi algorithm, and each word is modeled by a path, whose radical letters are presented by a 
state which loops on itself, ame the affixes are presented by the characters forming the affixes. 
Example: For the words 'le=lai' , 'lalai!..., are presented by the following diagram (Figure 2). 


Figure 2. Finite state automaton of words '\glalai' ,"gaslai" 


Based on all the affixes of the Arabic language, we build the global network. 
Our network is defined entirely by (Figure 3): 
- The set of all the states is Q,it consists of all the characters composing the affixes (suffixe, prefixe and 
infixe), of state A, the start state qr and the final state qr: 


= W atom mo mo W Teh PP Ph ee 
Q={q1, qr, Raa ie ns Ae ee 


- The set of possible transitions linking the characters of the affixes to the states A, qr and qr. 


Figure 3. Diagram of our global network and possible transitions between states 


How to find possible paths for the analysis of a word: 
- To analyze a word W, we search the network for the different possible paths associated with W. These 
paths are given by: 


s={é eB/P(w/ €) #0} 
B: the possible paths in our network that have the same lengths as w. 
- The solutions are the paths which make it possible to emit the word with a non-zero probability. we 


have adapted the Viterbi algorithm to the following format, to facilitate and reduce the calculation in 


(1): 
NL 
5:(cj) “C3 cj(de-1 Gaile jw2)) (1) 
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NL(x) is the non-zero value of x. We search for the states ci which give nonzero values of: 
6t-1(Cj) ° aij ° 1, (wz) 


67(qr)is the maximum probability of transmission of the word from a given path. By a recursive calculation 
we recover all the possible paths which give non-zero values (T the length of the word). With: 


aij ‘ Pe 
‘! : The state transition probability C; to C;, where: 


oe {t if the transition is possible 
if 0 else 


Ww, 
© - T "the character of the word w. 


1 si Cj=We 


1j(We) = {6 simon 1,(,)=1 


We take 


Initialisation 


Bole) ={ 0 ifnan 


0 ifnot 


The test was carried out on 20,000 words representing different grammatical categories (verbs and 
nouns). 96% of these words were correctly analyzed and our finite state automaton analyzer proposed 
different possible analyzes for these words, while it did not do so for the remaining 4%. 95% of these errors 
were due to not taking into account the calculation links between prefixes, roots and suffixes in our analyzer 
(Figure 4). 


List of suffixes Construction of the 


prefixes and infixes global network 


Returns from all the 
possible paths 


: : associated with the 
Word to Analysis Morphological word. (Viterbi 


analysis algorithm) 


Verification 


Correspondencetable 


Dictionary of 
Radicals 


Final results 


Figure 4. Steps of our morphological analyzer based on finite state automata 


Integrated approaches in a morphological analyzer of the Arabic language (Said Iazzi) 


304 0 ISSN: 2502-4752 


To evaluate our approach based on finite state automata, we created the different dictionaries of 
suffixes, prefixes and radicals. After, we integrate the correspondence tables between affixes of the Arabic 
language. Then, from the list of prefixes, suffixes and infixes, we generate a global network of states as 
indicated previously,without using lexical dictionaries. This is the main advantage of our analyzer over 
Buckwalter arabic morphological analyzer and the analyzer based on finite state automata. 


2.3. Integration of approaches to improve our analyzer 

Our contribution is to develop a morphological analyzer of the Arabic language by integrating and 
combining approaches to improve our morphological analyzer. In this work, we have combined several 
approaches to improve a morphological analyzer that minimizes the error rate. For this, we have combined an 
approach that relies on the surface pattern with an approach that uses finite state automaton to analyze a 
given word. 

We have many advantages in integrating and combining these approaches to develop and improve 
our analyzer. Among these advantages, we havereduction the size of the dictionaries used, unlike other 
analyzers. There is also the approach of the analyzer based on the surface patterns which uses the base of the 
surface patterns and models all the morphological variations of the derived words [20], [25]. On the other 
hand, the analyzer approach based on finite state automaton only articulates at the base of the roots. This 
approach, we generated a network of states without using lexical dictionaries. This is the main advantage of 
our analyzer over Buckwalter analyzer and surface pattern based analyzer. 

In a finite state automaton based approach, the morphological analysis processes all the words, 
unlike the approach by patterns, which only gives the analyzes associated with the surface patterns existing in 
the database of patterns. The finite state automaton approach does not require the basics of linguistic 
knowledge to perform the morphological analysis of a word.However, these approaches have a few 
drawbacks. The surface pattern approach deals onlywith derived words, while the finite state automata 
approach process even non-derived words. 


3. RESULTS AND DISCUSSION 

The proposed adaptation comes from our study to a number of perspectives resulting from the words 
analyzed by our morphological analyzers of Arabic words. In this work, we have improved our 
morphological analyzer by integrating several approaches. These analyzers are essentially based on three 
concepts: 

- The surface pattern [25]; 

-  BuckwalterArabicmorphological analyzer [10] used with correspondence tables between Arabic 
language affixes; 

7 A finite state automaton [24]. 

A corpus of words from the Arabic language, which is developed by a group of linguists, is chosen 
as a grammatical support because it uses the same notation of the Arabic language, which is considered 
standard for natural language processing [17], [18]. 

The morphological analyzer has been implemented according to the structure of a network of finite 
state automaton. The database of surface patterns is made up of correspondence tables between affixes. A 
Viterbi algorithm was used to implement our network. 

- The train dataset was 95% because it cover all categories, and test dataset was 5% of corpus which 
contains only the missed categories. As result, the morphological analyzer showed an accuracy of 
95.09%. 

- The morphological analyzer was trained on95% of the morphological corpus and tested on the 
remaining 5%. The test dataset is only 5% but it contains a significant volume of words which have 
never been used for training. 

As result, the morphological analyzer showed an accuracy of 95.09%. The results obtained by our 
analyzer are comparable to those of analyzers in general, but lower than those of specialized Arab analyzers 
which reach an accuracy of 97.09% (Table 1). Nevertheless, there are several possibilities for improvement, 
because it is possible to increase the size of the test morphological corpusand include more contextual 
information to disambiguate. 


Table 1. Accuracy of each morphological analyzer 
Morphological Our surface-pattern Our finite state automaton- Buckwalter Arabic morphological Our adaptive 
analyzers based analyzer based analyzer analyzer approaches 
Accuracy 94,41% 95,09% 93,87% 97,09% 
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4. CONCLUSION 

This article presents a concept of finite state automaton and the integration of tables of 
correspondence between affixes in the analysis approach. Moreover, it shows the importance of using surface 
patterns, detailing its working mechanism and the main areas of application and the great importance of its 
use in the field of natural language processing. The effectiveness of our proposed approach has been 
demonstrated experimentally. Our morphology analyzer obtained results comparable to general analyzers, 
with indicators and lower than those presented by specialized Arabic language analyzers. Improvements will 
be acceptable, as increasing the size of the morphological corpus for disambiguation. Finally, we hope to 
improve the architecture of adaptive approaches to morphological analyzes of Arabic words, by incorporating 
mechanisms pour choisir les types modéles de calcul avec des critéres d'évaluation bien réelle et des régles de 
transition claires. 
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